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Abstract. During compilation of a program, register allocation is the task of 
mapping program variables to machine registers. During register allocation, the 
compiler may introduce shuffle code, consisting of copy and swap operations, 
that transfers data between the registers. Three common sources of shuffle code 
are conflicting register mappings at joins in the control flow of the program, e.g, 
due to if-statements or loops; the calling convention for procedures, which often 
dictates that input arguments or results must be placed in certain registers; and 
machine instructions that only allow a subset of registers to occur as operands. 
Recently, Mohr et al. Q proposed to speed up shuffle code with special hardware 
instructions that arbitrarily permute the contents of up to five registers and gave a 
heuristic for computing such shuffle codes. 

In this paper, we give an efficient algorithm for generating optimal shuffle code 
in the setting of Mohr et al. An interesting special case occurs when no register 
has to be transferred to more than one destination, i.e., it suffices to permute the 
contents of the registers. This case is equivalent to factoring a permutation into a 
minimal product of permutations, each of which permutes up to five elements. 


1 Introduction 

One of the most important tasks of a compiler during code generation is register allo¬ 
cation, which is the task of mapping program variables to machine registers. During 
this phase, it is frequently necessary to insert so-called shuffle code that transfers values 
between registers. Common reasons for the insertion of shuffle code are control flow 
joins, procedure calling conventions and constrained machine instructions. 

The specification of a shuffle code, i.e., a description which register contents should 
be transferred to which registers, can be formulated as a directed graph whose vertices 
are the registers and an edge {u, v) means that the content of u before the execution of 
the shuffle code must be in v after the execution. Naturally, every vertex must have at 
most one incoming edge. Note that vertices may have several outgoing edges, indicating 
that their contents must be transferred to several destinations, and even loops {u, u), in¬ 
dicating that the content of register u must be preserved. We call such a graph a Register 
Transfer Graph or RTG. Two important special types of RTGs are outdegree-1 RTGs 
where the maximum out-degree is 1 and PRTGs where deg~(u) = deg~''(z;) = 1 for all 
vertices v (deg~ and deg'*' denote the in- and out-degree of a vertex, respectively). 

We say that a shuffle code, consisting of a sequence of copy and swap operations on 
the registers, implements an RTG if after the execution of the shuffle code every register 
whose corresponding vertex has an incoming edge has the correct content. The shuffle 
code generation problem asks for a shortest shuffle code that implements a given RTG. 



Fig.l ; Two example RTGs where the optimal shuffle code is not obvious. 


The amount of shuffle code directly depends on the quality of copy coalescing, a 
subtask of register allocation Q. As copy coalescing is NP-complete Q, reducing the 
amount of shuffle code is expensive in terms of compilation time, and thus cannot be 
afforded in all contexts, e.g., just-in-time compilation. 

Therefore, it has been suggested to allow more complicated operations than simply 
copying and swapping to enable more efficient shuffle code. Mohr et al. Q propose to 
allow performing permutations on the contents of small sets of up to five registers. The 
processor they develop offers three instructions to implement shuffle code; 
copy : copies the content of one register to another one 
permiS : cyclically shifts the contents of up to five registers 

permi2 3 : swaps the contents of two registers and performs a cyclic shift of the con¬ 
tents of up to three registers; the two sets of registers must be disjoint. 

In fact, the two operations permiS and permi2 3 together allow to arbitrarily permute 
the contents of up to five registers in a single operation. A corresponding hardware and 
a modified compiler that employs a greedy approach to generate the shuffle code have 
been shown to improve performance in practice Q. While the greedy heuristic works 
well in practice, it does not find an optimal shuffle code in all cases. 

It is not obvious how to generate optimal shuffle code using the three instructions 
copy, permiS and permi23 even for small RTGs. In the left RTG from Fig. [T] a 
naive solution would implement edges (1, 2) and (1,3) using copies and the remaining 
cycle (4 5 6) using a permiS. However, using one permi23 to implement the cycle 
(4 5 6) and swap registers 1 and 2, and then copying register 2 to 3 requires only two 
instructions. This is legal because the contents of register 1 can be overwritten. The 
same trick is not applicable for the right RTG in Fig. [^because of the loop (1,1) and 
hence three instmctions are necessary to implement that RTG. 

A maximum permutation size of 5 may seem arbitrary at first but is a consequence 
of instruction encoding constraints. In each permi instruction, the register numbers 
and their order must be encoded in the instruction word. Hence, [log 2 ((^)fcOl bits of 
an instruction word are needed to be able to encode all permutations of k registers out of 
n total registers. As many machine architectures use a fixed size for instruction words, 
e.g., 32 or 64 bits, and the operation type must also be encoded in the instruction word, 
space is very limited. In fact, for a 32 bit instruction word, 34 is the maximum number 
of registers that leave enough space for the operation type. 


Related Work. As long as only copy and swap operations are allowed, finding an 
optimal shuffle code for a given RTG is a straightforward task ||^ p. 56-57]. Therefore 
work in the area of compiler construction in this context has focused on coalescing 
techniques that reduce the number and the size of RTGs |[T]|^|^|^. 
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From a theoretical point of view, the most closely related work studies the case 
where the input RTG consists of a union of disjoint directed cycles, which can be in¬ 
terpreted as a permutation tt. Then, no copy operations are necessary for an optimal 
shuffle code and hence the problem of finding an optimal shuffle code using perini23 
and permiS is equivalent to writing tt as a shortest product of permutations of maxi¬ 
mum size 5, where a permutation of n elements has size k if it fixes n — k elements. 

There has been work on writing a permutation as a product of permutations that 
satisfy certain restrictions. The factorization problem on permutation groups from com¬ 
putational group theory Q is the task of writing an element g of a permutation group 
as a product of given generators S. Hence, an algorithm for solving the factorization 
problem could be applied in our context by using all possible permutations of size 5 or 
less as the set S. However, the algorithms do not guarantee minimality of the product. 
For the case that S consists of all permutations that reverse a contiguous subsequence of 
the elements, known as the pancake sorting problem, it has been shown that computing 
a factoring of minimum size is NP-complete Q. 

Farnoud and Milenkovic 0 consider a weighted version of factoring a permutation 
into transpositions. They present a polynomial constant-factor approximation algorithm 
for factoring a given permutation into transpositions where transpositions have arbitrary 
non-negative costs. For the case that the transposition costs are defined by a path-metric, 
they show how to compute a factoring of minimum weight in polynomial time. In our 
problem, we cannot assign costs to an individual transposition as its cost is context- 
dependent, e.g., four transpositions whose product is a cycle require one operation, 
whereas four arbitrary transpositions may require two. 


Contribution and Outline. In this paper, we present an efficient algorithm for gen¬ 
erating optimal shuffle code using the operations copy, permiS, and permi23, or 
equivalently, using copy operations and permutations of size at most 5. 

We first prove the existence of a special type of optimal shuffle codes whose copy 
operations correspond to edges of the input RTG in Section Removing the set of 
edges implemented by copy operations from an RTG leaves an outdegree-1 RTG. 

We show that the greedy algorithm proposed by Mohr et al. Q finds optimal shuf¬ 
fle codes for outdegree-1 RTGs and that the size of an optimal shuffle code can be ex¬ 
pressed as a function that depends only on three characteristic numbers of the outdegree-1 
RTG rather than on its structure. Since PRTGs are a special case of outdegree-1 RTGs, 
this shows that Greedy is a linear-time algorithm for factoring an arbitrary permuta¬ 
tion into a minimum number of permutations of size at most 5. 

Finally, in Section]^ we show how to compute an optimal set of RTG edges that will 
be implemented by copy operations such that the remaining outdegree-1 RTG admits a 
shortest shuffle code. This is done by several dynamic programs for the cases that the 
input RTG is disconnected, is a tree, or is connected and contains a (single) cycle. 

2 Register Transfer Graphs and Optimal Shuffle Codes 

In this section, we rephrase the shuffle code generation problem as a graph problem. An 
RTG that has only self-loops needs no shuffle-code and is called trivial. 
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It is easy to define the effect of a permutation on an RTG. Let G be an RTG and let 
TT be an arbitrary permutation that is applied to the contents of the registers. We define 
ttG = (VjTtE), where ttE = {(7r(u),t;) | {u,v) € E}. This models the fact that if 
V should receive the data contained in u, then after tt moves the data contained in u to 
some other register the data contained in 7r{u) should end up in v. We observe 
that for two permutations 7ri,7r2 of V, it is (7r2 o 7ri)G = 7r2(7ri(G)), i.e., we have 
defined a group action of the symmetric group on RTGs. For PRTGs, the shuffle code 
generation problem asks for a shortest shuffle code that makes the given PRTG trivial. 

Unfortunately, it is not possible to directly express copy operations in RTGs. In¬ 
stead, we rely on the following observation. Consider an arbitrary shuffle code that 
contains a copy a —?► 6 with source a and target b that is followed by a transposition r 
of the contents of registers c and d. We can replace this sequence by a transposition of 
the registers {c, d} and a copy T(a) —>■ t(&). Thus, given a sequence of operations, we 
can successively move the copy operations to the end of the sequence without increas¬ 
ing its length. Thus, for any RTG there exists a shuffle code that consists of a pair of 
sequences (( tti , ..., TTp), (ci,..., Ct)), where the tt ^ are permutation operations and the 
Ci are copy operations. We now strengthen our assumption on the copy operations. 

Lemma 1. Every instance of the shuffle code generation problem has an optimal shujfe 
code (( tti , ..., TTp), (ci,..., Ct)) such that 

(i) No register occurs as both a source and a target of copy operations. 

(ii) Every register is the target of at most one copy operation. 

(Hi) There is a bijection between the copy operations ci and the edges of nG that are 
not loops, where tt = TTp o 7rp_i o • • • o tti. 

(iv) Ifu is the source of a copy operation, then u is incident to a loop in ttG. 

(v) The number of copies is inax{degQ(t;) — 1,0}. 

Proof. Consider an optimal shuffle code of the form (( tti , ..., TTp), (ci,..., Ct)) as 
above and assume that the number t of copy operations is minimal among all optimal 
shuffle codes. 

Suppose there exists a register that occurs as both a source and a target of copy 
operations or a register that occurs as the target of more than one copy operation. Let k 
be the smallest index such that in the sequence ci,..., there is a register occurring 
as both a source and a target or a register that occurs as a target of two copy operations. 
We show that we can modify the sequence of copy operation such that the length of the 
prefix without such registers increases. Inductively, we then obtain a sequence without 
such registers. Let v and w denote the source and target of Ck, respectively. Let i denote 
the largest index such that Ci is a copy operation that has w as a source or target or such 
that Ci is a copy operation with target v. We distinguish three cases based on whether Ci 
has target v, target w, or source w. 

Case 1: The target of Ci is u; see Fig.|^ Let u denote the source of operation c^. 
The sequence first copies a value from u to u and from there to w. We then replace Ck 
by a copy with source u and target w. (If u = w,we omit the operation altogether.) This 
only changes the outcome of the shuffle code if the value contained in rt or u is modified 
between operations Ci and Ck, i.e., if there exists a copy operation Cj with i < j < k 
whose target is either u or v. But then already the smaller sequence ci,... ,Cj has u 
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Fig. 2: Illustration of the proof of Lemma The copies cj with i < j < k along the 
dashed edges would contradict either the choice of i or k. 


occur as both a source and a target or as a target of two operations, contradicting the 
minimality of k. 

Case 2: The target of Ci is w, see Fig. 2b In this case the copy operation Ci copies 
a value to w and later this value is overwritten by the operation Cfc. Note that by the 
choice of i there is no operation cj with i < j < k with source w. Thus, omitting the 
copy operation Ci does not change the outcome of the shuffle code. A contradiction to 
optimality. 

Case 3: The source of Ci is w, see Fig. 2c Let x denote the target of operation Ci. 
In this case first a value is copied from w to x and later the value in v is copied to w. 
We claim that no copy operation Cj with i < j < k involves a; or w. If a; occurs as the 
source of cj (as the target of cj), then x occurs as a source and target (two times as a 
target) in the sequence ci,... ,Cj, contradicting the minimality of k. If w is the target 
of Cj, then w occurs as a source and a target in the sequence ci,..., c^, contradicting 
the choice of k. If w is the source of Cj we have a contradiction to the choice of i. 
This proves the claim. We can thus, without changing the outcome of the shuffle code 
move the operation Ci immediately before the operation Ck- Then our sequence contains 
consecutive copy operations w ^ x and v ^ w. Replace these two operations by a 
cyclic shift of v, w and x and a copy operation w —i' v. This decreases the number of 
copy operations by 1 and thus contradicts the minimality of t. 

Altogether, in each case, we have either found a contradiction to the optimality 
of the shuffle code, to the minimality of the number of copy operations or we have 
succeeded in producing a shuffle code that has a longer prefix satisfying properties (i) 
and (ii). Inductively, we obtain a shuffle code satisfying both (i) and (ii). Fix such a code. 
Since no register is both source and target of a copy operation, the copy operations are 
commutative and can be reordered arbitrarily without changing the result. 

For property (iii) first observe that the only way to transfer a value from u to u is via 
a copy operation u ^ v. This is due to the fact that the shuffle code is correct, that no 
node occurs as both a source and a target of copy operations, and that tt only permutes 
the values in the initial registers but does not duplicate them. Thus, for every edge there 
must be a corresponding copy operation. Conversely, this number of copy operations 
certainly suffices for a correct shuffle code for ttG. 

For property (iv) consider a copy operation from u to u such that u is not incident 
to a loop. If the in-degree of v in ttG were 1, then there would be an incoming edge, 
which would correspond to a copy operation with target it, which is not possible by 
property (i). Thus, u has in-degree 0. But then, the contents of u are irrelevant and we 
can replace the copy from u to i; by an operation that swaps the contents of u and v, 
resulting in a shuffle code with fewer copy operations. 
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By property (iv) every vertex that is the source of an edge in ttG is incident to a loop. 
Hence ~ number of non-loop edges in ttG, which 

is the same as the number of copy operations by property (iii). Note that by definition 
TT only permutes the out-degrees of the vertices, and hence J2vev ~ 

1,0} = Ylivi^v inax{degQ(u) — 1, 0}. This shows property (iv) and hnishes the proof. 

□ 

We call a shuffle code satisfying the conditions of Lemma [T]«ormafeed. Observe 
that the number of copy operations used by a normalized shuffle code is a lower bound 
on the number of necessary copy operations since permutations, by definition, only 
permute values but never create copies of them. 

Consider now an RTG G together with a normalized optimal shuffle code and one 
of its copy operations u —>■ v. Since the code is normalized, the value transferred to v by 
this copy operation is the one that stays there after the shuffle code has been executed. 

If V had no incoming edge in G, then we could shorten the shuffle by omitting the copy 
operation. Thus, v has an incoming edge {u', v) in G, and we associate the copy u ^ v 
with the edge {u', v) of G. In fact, u' = where tt = tt^ o • • • o tti. In this way, 

we associate every copy operation with an edge of the input RTG. In fact, this is an 
injective mapping by Lemma[2(ii). 

Lemma 2. Let ((tti, ..., tt^), (ci,..., Ct)) be an optimal shuffle code S for an RTG 
G = {V, E) and let G C E be the edges that are associated with copies in S. Then 

(i) Every vertex v has max{degQ(t;) — 1, 0} outgoing edges in G. 

(ii) G — G is an outdegree-1 RTG. 

(iii) TTi,..., TTp is an optimal shuffle code for G — C. 

Proof For property (i) observe that, since permuting the register contents does not 
duplicate values, it is necessary that at least max{degg(z;) — 1, 0} of the edges of v 
are implemented by copy operations and thus are in G. By property (v) of Lemma 
the number of copy operations is exactly the sum of these values, which immediately 
implies that equality holds at every vertex. 

Property (ii) follows immediately from property (i). 

Finally, for property (iii), suppose there is a shorter optimal shuffle code 7rj,..., 
with p' < p for G — G. Let tt' = o• • • o7rj. Then tt'G has \G\ edges that are not loops 
and by creating a copy operation for each of them we obtain a shorter shuffle code. This 
is a contradiction to the optimality of the original shuffle code. Hence property (iii) 
holds. □ 

Lemma 1^ shows that an optimal shuffle code for an RTG G can be found by first 
picking for each vertex one of its outgoing edges (if it has any) and removing the re¬ 
maining edges from G, second finding an optimal shuffle code for the resulting outdegree-1 
RTG, and finally creating one copy operation for each of the previously removed edges. 
Fig.j^shows that the choice of the outgoing edges is crucial to obtain an optimal shuffle 
code. 

In the following, we first show how to compute an optimal shuffle code for an 
outdegree-1 RTG in Section]^ Afterwards, in Section]^ we design an algorithm for 
efficiently determining a set of edges to be removed such that the resulting outdegree-1 
RTG admits a shuffle code with the smallest number of operations. 
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(a) The original RTG G needs one permuta¬ 
tion and one copy operation. 


(b) After removing the edge (2, 3), the RTG 
needs two permutation operations. 


Fig.3: The RTG G obtains the normalized optimal shuffle code (tti, ci), where tti = 
(23456) and Ci = 3 —> 1. However, after removing the edge (2,3) (instead of (1, 2)) 
we cannot achieve an optimal solution anymore. 


3 Optimal Shuffle Code for Outdegree-1 RTGs 


In this section we prove the optimality of the greedy algorithm proposed by Mohr et 
al. Q for outdegree-1 RTGs. Before we formulate the algorithm, let us look at the 
effect of applying a transposition t = {u v) to contiguous vertices of a fc-cycle K = 
, Ek) in a PRTG G, where fc-cycle denotes a cycle of size k. Hence, u,v G Vk 
and (m, v) G Ex- Then, in tG, the cycle K is replaced by a (fc — l)-cycle and a vertex 
V with a loop. We say that t has reduced the size of iT by 1. If tK is trivial, we say 
that T resolves K. It is easy to see that permiS reduces the size of a cycle by up to 4 
and permi23 reduces the sizes of two distinct cycles by 1 and up to 2 , respectively. 
We can now formulate Greedy as follows. 

1. Complete each directed path of the input outdegree-1 RTG into a directed cycle, 
thereby turning the input into a PRTG. 

2. While there exists a cycle K of size at least 4, apply a permiS operation to reduce 
the size of K as much as possible. 

3. While there exist a 2-cycle and a 3-cycle, resolve them with a permi2 3 operation. 

4. Resolve pairs of 2-cycles by permi2 3 operations. 

5. Resolve triples of 3-cycles by pairs of permi2 3 operations. 

We claim that Greedy computes an optimal shuffle code. Let G be an outdegree-1 
RTG and let Q denote the set of paths and cycles of G. For a path or cycle a G Q, 
we denote by size(cr) the number of vertices of a. Define X = ©ze((T)/4j and 

ai = |{cr G Q I size((T) = i mod 4}| for i = 2,3. We call the triple sig(G) = 
{X, 02 , Us) the signature of G. 

Lemma 3. Let G be an outdegree-1 RTG with sig(G) = (2f, 02 , 03 ). The number 
Greedy(G) of operations in the shuffle code produced by the greedy algorithm is 
Greedy(G) = X + max{ [(02 + 03)721 , [(02 + 203)73]}. 

Proof After the first step we have a PRTG with the same signature as G. Clearly, 
Greedy produces exactly X operations for reducing all cycle sizes below 4. After¬ 
wards, only permi2 3 operations are used to resolve the remaining cycles of size 2 
and 3. 

If 02 > 03 , then first 03 operations are used to resolve pairs of cycles of size 2 and 3. 
Afterwards, the remaining 02 — 03 cycles of size 2 are resolved by using [(02 — 03 ) 72 ] 
operations. In total, these are [(02 + 03 ) 72 ] operations. 
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If as > 02 , then first 02 operations are used to resolve pairs of cycles of size 2 and 3. 
Afterwards, the remaining 03 — 02 cycles of size 3 are resolved by using [ 2(03 — a 2 )/ 3 ] 
operations. In total, these are [(02 + 2 a 3 )/ 3 ] operations. 

We observe that (02 + 03)72 < (02 + 2 a 3)/3 holds if and only if 02 < 03 and that 
equality holds for 02 =03. Since [•] is a monotone function, this implies that the total 
cost produced by the last part of the algorithm is max{ [ (02 + 03)/2], [ (02 + 203 )/3]}. 

□ 

In particular, the length of the shuffle code computed by Greedy only depends 
on the signature of the input RTG G. In the remainder of this section, we prove that 
Greedy is optimal for outdegree-1 RTGs and therefore the formula in Lemma[^actu- 
ally computes the length of an optimal shuffle code. 

Lemma 4 . Let G, G' be PRTGs with sig{G) = (AT, 02,03), sig(G') = (AT', 02,03) 
and Greedy(G) - Greedy(G') > c, and let (Z\x, 2\2, ^3) = sig(G) - sig(G'). If 
02 > 03, then2Ax+^2+‘^3 < —2c+l.Ifa3 > 02, f/ren 3 Z\x+^ 2 + 22\3 < — 3 c+ 2 . 

Proof. We assume that Greedy(G) — Greedy(G') > c and start with the case that 
02 > 03. By Lemma|^and basic calculation rules for [•], we have the following. 

Greedy (G) = Ar+|'(o 2 + 03 ) 72 ] <A^ + (02 + 03 + 1)72 
Greedy(G^) > X' + [(o^ + 03)72] + A^ + Ax + (02 + 03 + A2 + 213)72 

Therefore, their difference computes to 

Greedy(G) - Greedy(G') < -2\x - (^2 + ^3 - 1)72 

= —(2Ax + 212 + A 3 — 1 ) 72 . 

By assumption, we thus have — (22lx + 2 I 2 + 2 I 3 — 1)72 > c, or equivalently 22lx + 
2 I 2 + 2 I 3 < — 2 c + 1 . 

Now consider the case 03 > 02 . By Lemma[^ we have the following. 

Greedy(G) = A^ + I" (02 + 203)73] + A" + (02 + 203 + 2)73 
greedy(g') > a:' + [(02 + 203)73] > a: + 2ix + (02 + 203 + 212 + 2213)73 

Similar to above, their difference computes to 

Greedy(G) - Greedy(G') < - Ax - (A 2 + 22 I 3 - 2)73 

= -( 3 Ax + A 2 + 2 A 3 - 2 )/ 3 . 

Similarly as above, by assumption we have — (32lx + 2 I 2 + 22 I 3 — 2)73 > c, which is 
equivalent to 32lx + 2 I 2 + 2 A 3 < —3c + 2 . □ 

Lemma gives us necessary conditions for when the Greedy solutions of two 
RTGs differ by some value c. These necessary conditions depend only on the difference 
of the two signatures. To study them more precisely, we define L'i{Ax, A 2 , A 3 ) = 
2Ax + A 2 + A 3 and L^ 2 iAx, A 2 , A 3 ) = 32lx + A 2 + 22 I 3 . Next, we study the effect 
of a single transposition on these two functions. 
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Fig. 4: The transposition r = (5 8) acting on PRTGs. Affected edges are drawn thick. 
Read from left to right, the transposition is a merge; read from right to left, it is a split. 
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(a) Signature change (Ax, A2, A3). (b) Values of 'Pi (left) and ^2 (right). 


Table 1: Signature changes and W values for merges. Row and column are the cycle 
sizes modulo 4 before the merge. 


Let G = {V, E) be a PRTG with sig(G) = {X, 02,03) and let r be a transposition 
of two elements in V. We distinguish cases based on whether the swapped elements are 
in different connected components or not. In the former case, we say that r is a merge, 
in the latter we call it a split, see Fig.j^for an illustration. 

We start with the merge operations as they are a bit simpler. When merging two 
cycles of size si and S2, respectively, they are replaced by a single cycle of size si + S2- 
Note that removing the two cycles may decrease the values 02 and 03 of the signature 
by at most 2 in total. On the other hand, the new cycle can potentially increase one of 
these values by 1 . The value X never decreases, and it increases by 1 if and only if si 
mod 4 + S2 mod 4 > 4 . Table la shows the possible signature changes {Ax, A2, A3) 
resulting from a merge. The entry in row i and column j shows the result of merging 
two cycles whose sizes modulo 4 are i and j, respectively. Table [Tb| shows the corre¬ 
sponding values of Ei and ^2- Only entries with i < j are shown, the remaining cases 
are symmetric. 


Lemma 5. Let G be a PRTG with sig(G) = {X, 02,03) and let t be a merge. Then 
GREEDY(G) < GREEDY(rG). 

Proof. Suppose GREEDY(rG) < Greedy(G). Then Greedy(G) - GREEDY(rG) > 
1 and by Lemma [^either < — 1 or <^2 < ~ 1 - However, Table [Tb] shows the values 
of El and E2 for all possible merges. In all cases it is 'f'l, tf'2 > 0 . A contradiction. □ 


In particular, the lemma shows that merges never decrease the cost of the greedy 
solution, even if they were for free. We now make a similar analysis for splits. It is, 
however, obvious that splits indeed may decrease the cost of greedy solutions. In fact, 
one can always split cycles in a PRTG until it is trivial. 
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Fig. 5 : Transition graphs for (left) and W2 (right). 


First, we study again the effect of splits on the signature change {Ax, ^^2, ^3)- 
Since a split is an inverse of a merge, we can essentially reuse Table [Ta| If merging two 
cycles whose sizes modulo 4 are i and j, respectively, results in a signature change of 
{Ax, A2, A3), then, conversely, we can split a cycle whose size modulo 4 is i + j into 
two cycles whose sizes modulo 4 are z and j, respectively, such that the signature change 
is {—Ax, —A2, —A3), and vice versa. Note that given a cycle whose size modulo 4 is 
s one has to look at all cells {i,j) with i + j = s (mod 4 ) to consider all the possible 
signature changes. Since 'Pi, ^2 are linear, negating the signature change also negates 
the corresponding value. Thus, we can reuse Table [Tb]for splits by negating each entry. 

Lemma 6. Let G = {V, E) be a PRTG and let x be a cyclic shift of c vertices in V. Let 
further {Ax, A2, A3) be the signature change affected by tt. Then 'l/i{Ax, A2, A3) > 


r(c- 1)/21 andE2{Ax,A2,A3) > -r(3c-3)/4l. 


Proof We can write tt = Tc_ 1 o • • • o ti as a product of c — 1 transpositions such that any 
two consecutive transpositions and affect a common element for i = 1,..., c—1. 

Each transposition decreases Ei (or 'P2) by at most 1 , but a decrease happens only 
for certain split operations. However, it is not possible to reduce •f'l (or 'P2) with every 
single transposition since for two consecutive splits the second has to split one of the 
connected components resulting from the previous splits. To get an overview of the 
sequences of splits that reduce the value of Ei (or of <^'2) by 1 for each split, we consider 
the following transition graphs for Wk (fc = 1 , 2 ) on the vertex set S = { 0 , 1 , 2 , 3 }. 
In the graph Tk there is an edge from i to j if there is a split that splits a component 
of size i mod 4 such that one of the resulting components has size j mod 4 and this 
split decreases Ekhy 1 . The transition graphs Ti and T2 are shown in Fig.|^ 

For Wi the longest path in the transition graph has length 1 . Thus, the value of Ei can 
be reduced at most every second transposition and <Pi{Ax, A2, A3 )>-r(c-i)/2i. 

For ^2 the longest path has length 3 (vertex 1 has out-degree 0 ). Therefore, after at 
most three consecutive steps that decrease ^2, there is one that does not. It follows that 
at least [(c— 1 )/ 4 J operations do not decrease and consequently at most [( 3 c— 3 )/ 4 ] 
operations decrease ^2 by 1. Thus, 'p2{Ax, A2, A3) > — [( 3 c — 3 )/ 4 ]. □ 

Since permiS performs a single cyclic shift and permi2 3 is the concatenation 
of two cyclic shifts. Lemmas andcan be used to show that no such operation may 
decrease the number of operations Greedy has to perform by more than 1 . 

Corollary 1. Let G be a PRTG and let tt be an operation, i.e., either a permi2 3 or a 
permiS. Then Greedy(G) < GREEDY(7rG) + 1. 
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Proof. Assume for a contradiction that Greedy(G') > GREEDY(7rG') — 1 . By Lemma|^ 
we have that either Wi{Ax, A2, Af) < —3 or ^3) < ~ 4 . 

We distinguish cases based on whether tt is a permiS or a permi2 3. If tt is a 
permiS, then it is a c-cycle with c < 5. By Lemma|^ we have that A2, A^) > 

—2 and ]l/2{Ax, A2, A3) > — 3 . This contradicts the above bounds from Lemmaj^ 

If TT is a permi2 3, then it is a composition of a 2 -cycIe and a c-cycle with c < 3 . 
According to Lemma]^ both cycles contribute at least —1 to 'f'l, and at least —1 and —2 
to <^2- Therefore, we have ^i{Ax, A2, A3) > —2 and ^2{‘^x, ^2, ^3) > ~ 3 . This is 
again a contradiction. □ 

Using this corollary and an induction on the length of an optimal shuffle code, we 
show that Greedy is optimal for PRTGs; if no operation reduces the number of opera¬ 
tions Greedy needs by more than 1 , why not use the operation suggested by Greedy? 

Theorem 1 . Let G be a PRTG. An optimal shuffle code for G takes Greedy(G) oper¬ 
ations. Algorithm GREEDY computes an optimal shuffle code in linear time. 

Proof. The proof is by induction on the overall length of an optimal shuffle code. 
Clearly, GREEDY computes optimal shuffle codes for all instances that have a shuffle 
code of length 0. 

Assume that G admits an optimal shuffle code of length fc + 1 . We show that 
Greedy(G) = fc+ 1 . First of all, note that Greedy(G) > A:+l as it computes a shuffle 
code of length Greedy (G). Let tti, ..., be a shuffle code for G. Then obviously 
TTfe+iG admits an optimal shuffle code of length fc, and therefore GREEDY(7rfc+iG) = k 
by our inductive assumption. Corollary[^implies Greedy(G) < GREEDY(7rfc+iG) + 
1 = fc + 1; the induction hypothesis is proved. 

Clearly, algorithm Greedy indeed computes a correct, and thus optimal, shuffle 
code. It can easily be implemented to run in linear time. □ 

Moreover, since merge operations may not decrease the cost of Greedy and any 
PRTG that can be formed from the original outdegree -1 RTG G by inserting edges can 
be obtained from the PRTG G' formed by Greedy and a sequence of merge operations, 
it follows that the length of an optimal shuffle for G is Greedy(G'). 

Lemma 7 . Let G be an outdegree -1 RTG and let G' be the PRTG formed by completing 
each directed path into a directed cycle. Then the length of an optimal shuffle code of 
G is Greedy(G'). 

Proof. Assume tti ,..., is an optimal shuffle code for G. Of course, applying tt = 
TTfe o • • • o TTi to G maps every value of G somewhere, that is, tti ,..., tt^ is actually an 
optimal shuffle code for some instance G" that consists of a disjoint union of directed 
cycles and contains G as a subgraph. It is not hard to see that G" can be obtained from 
G’ by a sequence of merge operations ri,..., tj, i.e., G" = o • • • o riG'. Lemma|^ 
implies that Greedy(G') < GREEDY(riG') < • • • < GREEDY(Tt o • • • o tiG') = 
Greedy(G") = k, where the last equality follows from Theorem[^ the optimality of 
Greedy for PRTGs. □ 

By combining Theorem[ 2 and Lemma| 7 j we obtain the main result of this section. 

Theorem 2 . Let G be an outdegree -1 RTG. Then an optimal shuffle code for G requires 
Greedy(G) operations. GREEDY computes such a shuffle code in linear time. 
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4 The General Case 


In this section we study the general case. A copy set of an RTG G = (V, E) is a 
set C C E such that G — C = (V, i? — ( 7 ) is an outdegree -1 RTG and \G\ = 
max{deg^(u) — 1 , 0 }. We denote by C{G) the set of all copy sets of G. Ac¬ 
cording to Lemma 1 ^ an optimal shuffle code for G can be found by finding a copy set 
G S C{G) such that the outdegree -1 RTG G—C admits a shortest shuffle code. By The¬ 
orem]^ an optimal shuffle code for G — G can be computed with the greedy algorithm 
and its length can be computed according to Lemma[^ 

We thus seek a copy set G G C(G) that minimizes the cost function Greedy(G — 
G) = X -|-max{|’(a2 -f a3)/2], |’(a2 -f 2a3)/3]}, where {X, 02, a3) is the signature of 
G — C. Such a copy set is called optimal. Clearly, this is equivalent to minimizing the 
function 


„ , r <22+03 02+203, 

Greedy {G-C) = X + max{ —-—,---} 

z o 


X + ‘y if 02 + 03 

X ^ if 02 + 03 


To keep track of which case is used for evaluating Greedy', we define difF(G — G) = 
02 — 03 and compute for each of the two function parts and every possible value d a 
copy set Gd with difF(G — Gd) = d that minimizes that function. 

More formally, we define cost^(G — G) = X + 5O2 + ^03 and cost^(G — G) = 
X + |o2 + §03 and we seek two tables + g [’], + g [']’ smallest cost 

cost*(G — C) that can be achieved with a copy set G S C(G) with difF(G — G) = d. 
We observe that Tq[(]\ = 00 for d < —n and for d > n. The following lemma shows 
that the length of an optimal shuffle code can be computed from these two tables. 


Lemma 8. Let G = (V, E) be an RTG. The length of an optimal shuffle code for G is 
max{deg+(u) - 1, 0} + min{mind>orT^[d]l, mind<orT|[d]l}. 

Proof. Let m = max{deg^(u) — 1 , 0 }. Consider an optimal normalized shuffle 

code for G, which, according to Lemma|^ consists of a copy setG C E and a sequence 
of k permutation operations, i.e., the length of the shuffle code ism + k. Let {X, 02,03) 
denote the signature of G — G and let d = 02 — 03. If 02 > 03, or equivalently d > 0 , 
then according to Theorem 0 we have k = Greedy(G) = X + [(02 + <23)/2] = 
[X -f (02 + 03)72] = [co^(G — G)], and therefore the length of the shuffle code 
is at most m + ^^[d]. If 02 < 03, i.e., if d < 0 , then we have k = Greedy(G) = 
X + [(02 + 203)73] = [X -I- (02 + 203)73] = |’cost^(G — G)], and therefore the 
length of the shuffle code is at most m + Tq [d]. In either case the length of the shuffle 
code is bounded by the expression given in the statement of the theorem. 

Conversely, assume that the minimum of the expression is obtained for some value 
T7,[d].Ifd < 0 (resp. if d > 0 ), there exists a copy set G such that sig(G — G) = 
{X,a2,a3) and Greedy(G - G) = |"cost^(G - G)] (resp. Greedy(G - G) = 
icost^(G — G)]) is at most T 7 ;[d] (resp. at most ^^[d]). Then, clearly, the shuffle code 
defined by G and Greedy applied to G — G has length at most m + [Tg[d]] (resp. 
m+|'T^[d]]). □ 
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In the following, we show how to compute for an RTG G a table Tg[ ] with 

Tc[d] = min cost(G — C) 

CgC(G) 

diff(G-C)-d 

for an arbitrary cost function cost(G— G) = c(sig(G— G)), where c is a linear function. 
This is done in several steps depending on whether G is disconnected, is a tree, or is 
connected and contains a cycle. Before we continue, we introduce several preliminaries 
to simplify the following calculations. We denote by Ps a directed path on s vertices. 

Definition 1 . A map f that assigns a value to an outdegree -1 RTG is signature-linear 
if there exists a linear function p: ^ K such that f{G) = g{sig{G)) for every 

outdegree -1 RTG G. For a signature-linear function f, Af(s) = /(Pg+i ) - f{Ps) is 
the correction term. 

Note that both cost = c o sig and diff = d o sig with d{X, 02,03) =02 — 03 are 
signature-linear. The correction term A / (s) describes the change of / when the size of 
one connected component is increased from s to s -f 1. 

Lemma 9 . Let f be a signature-linear function. Then the following hold: 

(i) /(Gi U G2) = /(Gi) -f /(G2) for disjoint outdegree -1 RTGs Gi, G2, 

(ii) Let G = (V, E) be an outdegree -1 RTG and let v G V with in-degree 0 . Denote by 
s the size of the connected component containing v and let G'*' = ( 1 ^ U {u}, E U 
{(Ojt;)}) where u is a new vertex. Then f{G~^) = f(G) -f Af{s). 

Proof. For (|^ observe that sig(Gi U G2) = sig(Gi) -f sig(G2); then the statement 
follows from the signature-linearity of /. 

For (j^ observe that by adding u, we replace a connected component of size s by 
one of size s-fl. Thus sig(G“'') = sig(G) —sig(Ps)-|-sig(Ps-i-i)- The statement follows 
from the signature-linearity of / and the definition of Z\y (s). □ 

Note that Z\/(s) = Z\/(s -f 4 ) for all values of s and hence it suffices to know the size 
of the enlarged component modulo 4 . 

The main idea for computing table To\-] by dynamic programming is to decompose 
G into smaller edge-disjoint subgraphs G = Gi U • • • U Gfc such that the copy sets of 
G can be constructed from copy sets for each of the Gi. We call such a decomposition 
proper partition if for every vertex v of G there exists an index i such that Gi contains 
all outgoing edges of v. Let Gi,..., Gfc be a proper partition of G and let Ci C C{Gi) 
for j = 1 ,..., fc. We define Ci® ■ ■ ■ ®Ck = {Gi U • • • U Gfc | Gi S G, i = 1 ,..., /c}. 
It is not hard to see that C(Gi U • • • U Gfc) = C(Gi) ® ■ ■ ■ ®C{Gk). 

4.1 Disconnected RTGs 

We start with the case that G is disconnected and consists of connected components 
Gi,..., Gfc, which form a proper partition of G. The main issue is to keep track of 
diff and cost. For an RTG G, we define C(G; d) = {G € C(G) | difF(G — G) = d}. 
By Lemma| 9 ]j^ and the signature-linearity of diff, if Gi S C{Gi] di) for i = 1 , 2 , then 
Gi U G2 S C(Gi U G2; + d2). This leads to the following lemma. 
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Lemma 10. Let G be an RTG and let Gi , G2 be vertex-disjoint RTGs. Then 

(i) C{G) =[j^C{G-,d) and 

(ii) C(Gi U G 2 ; d) = Ud' (C(Gi; d') ® C(G2; d - d'))- 

Proof. Equation (j^ follows immediately from the definition of C{G;d). For Equa¬ 
tion (j^ observe that if Gi G C(Gi; d') and G2 G C(G2; d — d'), then G = Gi U G2 is 
a copy set of G and by Lemma| 9 ][^ difF(G — G) = difF((Gi — Gi) U (G2 — G2)) = 
difr(Gi - Gi) -f difr(G2 - G2) = d' -f d - d' = d, and hence Gi U G2 S C(G; d). 
Conversely, if G G C(G;d), define Gi = G D Ei where Ei is the edge set of Gi 
for i = 1 , 2 . Let d' = diff(Gi — Gi). As above, it follows from Lemma | 9 ]j^ that 
d = diff(G - G) = diff(Gi - Gi) + difr(G2 - G2) = d' + difr(G - G), and hence 
difr(G -G) = d-d'. Thus G G C{Gi-d') ® C{G2-,d - d'). □ 

By further exploiting the signature-linearity of cost, we also get cost((Gi U G2) — 
(Cl U G2)) = cost(Gi — Cl) cost(G2 — G2), allowing us to compute the cost of 
copy sets formed by the union of copy sets of vertex-disjoint graphs. 


Lemma 11 . Let Gi,G2 be two vertex-disjoint RTGs and let G = Gi U G2. Then 
Tc[d] = mind-{TGi [d'] -f [d -d']}. 

Proof. Applying the definition of Tq [•] as well as Lemma[To|(pIjl and Lemma|^(|^ yields 


TgM] = min cost(G — G) = min costfG —G) 

CGC(G;d) CGUd'(C(Gi;d')(giC(G2;d-d')) 


= mm 
d' 


= min 
d' 


min cost(G — C) 

C^C{Gi]d')^C{G2\d-d') 


min cost(Gi — Ci) + min cost(G2 — C2) 
Gi£C{Gi]d') G2^C{G2\d-d') 


= minjTGi [d'] + [d - d']}. 

d' 


□ 


By iteratively applying Lemma 11 we compute Tg[ ] for a disconnected RTG G 
with an arbitrary number of connected components. In the following, we will analyze 
the running time needed for the combination of all tables Tg- [•] for the components Gi 
ofG. 


Lemma 12 . Let G be an RTG with n vertices and connected components Gi,..., G^. 
Given the tables Tq. [-(fori = 1,... ,k, the table Tg[-] can be computed in O(n^) time. 


Proof. Let rii denote the number of vertices of Gi. For two graphs Hi and H2 with 
hi and h2 vertices, respectively, computing ThiuH2['] according to Lemma 11 takes 
time 0 {hi ■ /12) and the table size is 0 {hi -f h2). Thus, iteratively combining the 
table for Gi+i with the table for [jl-o takes time OifGlZi n-i+i YZi=i ^j)- is 


Y7i=i ^r+i YZj=i ^ YH^i ni+in = nY,i=i ni+i < vf. Hence, the running time 
is 0(n^). □ 
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4.2 Tree RTGs 


For a tree RTG G, we compute Tq[-] in a bottom-up fashion. The direction of the edges 
naturally defines a unique root vertex r that has no incoming edges and we consider G 
as a rooted tree. For a vertex v, we denote by G{v) the subtree of G with root v. Let v 
be a vertex with children vi,... ,Vk- 

How does a copy set G of G{v) look like? Clearly, G{v) — C contains precisely 
one of the outgoing edges of v, say (u, Vj). Then Zj = {(u, Vi) \ i ^ j} C G. Graph 
G{v) — Zj has connected components G{vi) for i ^ j, whose union we denote G^j, 
and one additional connected component G^{vj) that is obtained from G{vj) by adding 
the vertex v and the edge {v, Vj). This forms a proper partition of G{v) — Zj. As above, 
we decompose the copy set G — Zj further into a union of a copy set G^j of G^j and a 
copy set Gj of G+(uj ). Graph G^j is disconnected and can be handled as above. Note 
that the only child of the root of G^{vj) is Vj and hence Gj is a copy set of G(vj). 

For expressing the cost and difference measures for copy sets of G'^(vj) in terms 
of copy sets of G(vj), we use the correction terms Z\cost and Z\diff. By Lemma|^(j^, 
difF(G“''(t;j) — Gj) = difF(G('(;j) — Gj) + Z\diff(s), where s is the size of the root 
path P{vj,Gj) of G{vj) — Cj, i.e., the size of the connected component of G{vj) — Gj 
containing Vj. An analogous statement holds for cost. More precisely, it suffices to 
know s modulo 4 . Therefore, we further decompose our copy sets as follows, which 
allows us to formalize our discussion. 

Definition 2 . For a tree RTG G with root v and children ,..., Vk, we define 
C{G; d, s) = {G G C(G] d) \ \P{v, G)| = s (mod 4 )}. Bfe further decompose these by 
C{G; d, s, j) = {G G C(G; d, s) \ {v, Vj) ^ G}, according to which outgoing edge of 
the root is not in the copy set. 

The following lemma gives calculation rules for composing copy sets. 

Lemma 13 . Let G be a tree RTG with root v and children vi, ... ,Vk and for a fixed 
vertex Vj, 1 < j < fc, let G^{vj) be the subgraph of G induced by the vertices in G{vj) 
together with v. Let further G^j = G{vi) and Zj = {(w, Vi) \i j}. Then 

(i) C(G; d) = ULo C(G; d, s) and C{G; d, s) = U ■=! C(G; d, s,j). 

(ii) C{G+{vj)-,d,s) = C{G{vj)-d-Adisis),s - 1 ). 

(Hi) C{G-,d,s,j) = {C{G^j]d') ®C{G+{vj)\d- d',s) (g) {Zj}). 

Proof. The statements in (j^ follow immediately from the definitions of C(G; d, s) and 
C(G; d, s,j)- We continue with Statement dm. Since v in G^{vj) has only one child Vj, 
the edge {v, Vj) is not in any copy set of G^Uj). Therefore, the copy sets of C(G+(uj)) 
and C{G{vj)) are in one-to-one correspondence. We need to understand how the parti¬ 
tion into copy sets with difference measure d and root path length s (modulo 4 ) respects 
this bijection. Let s be the root path size of G+(uj) — G for a copy set G G C{G'^(vj)). 
Obviously, \P{G{vj) — G)| = \P{G^(^Vj) — G)| — 1 = s — 1 . Moreover, going from 
G~^{vj) — G to G{vj) — G replaces a connected component of size s by one of size 
s — 1 . Therefore sig(G(uj) — G) = sig(G+(uj) — G) — sig(Ps) -I- sig(Ps-i-i)- By the 
signature-linearity of diff, we have diff(G('(;j ) — G) = difF(G“''(uj) — C) — Z\diff(s). 
Note further that Z\diff (s) = Z\diff(s + 4 ) for every value of s, and hence it suffices to 
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know s mod 4 . Overall, it follows that a copy set C G C(G~^(vj); d, s) is a copy set 
of G{vj) with difference measure diff(G'''(wj) — G) — 4 \diff(s) and root path size 
modulo 4 being s — 1 . Thus C G G{G{vj),d — Z\diff(s),s — 1 ). And conversely 
G G G{G{vj), d — Z\diff(s), s — 1 ) satisfies G G C{G'^{vj); d, s). 

Next, we consider Statement ( [iii| . First, observe that the copy sets C of G whose 
root path starts with {v, Vj) are exactly those copy sets of G that contain all edges in Zj. 
These sets correspond bijectively to copy sets of G — Zj. Thus C(G; d, s,j) = C{G — 
Zj]d, s)^{Zj}. Observe that G — Zj = G^j U G~^{vj) is a proper partition of G— Zj. 
Furthermore, the root path of any copy set of this graph lies in G^{vj). Therefore, 
Lemma [lO|([i^ implies that C{G — Zj\d, s) = [}j^,{C{G^j\d') ® {C[G{vj)'^\d —d', s). 
Combining this with the previously derived description of C{G-,d,s,j) yields State¬ 
ment ( [mi l. □ 

To make use of this decomposition of copy sets, we extend our table T with an 
additional parameter s to keep track of the size of the root path modulo 4 . We call the 
resulting table T. More formally, T„[d, s] = min(;7gc((3(„).d_s) cost(G(t;) — G). It is not 
hard to see that Tg[-] can be computed from Tr[', ■] for the root r of a tree RTG G. 

Lemma 14 . Let G be a tree RTG with root r. Then To[d\ = ming Tr[d^ s]. 

Proof. Using the definitions of Tc\-] and •], we obtain 

Tc\<l\= min costfG—G) = min min cost(G—G) = min TJd.s]. 

CeC{G-,d) se{0,...,3} CeC{G-,d,s) SG{0,...,3} 

□ 


To compute T„[-, •] in a bottom-up fashion, we exploit the decompositions from 
and the fact that we can update the cost function from G{vj) — Gj to 
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Lemma 

G"*" {vj ) — Gj using the correction term Acost- The proof is similar to that of Lemma 
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Lemma 15. Let G be a tree RTG, let v be a vertex of G with children vi, ... ,Vk, and let 
G{vi) = {Vi,Ei) fori = l,...,fc. Let further G^j = {V^j,E^j) = [Ji^ij^j G{v^). 
Then the following equation holds. 

fy[d, s] = min minTg^^- [d'] + fy. [d - d' - Z\diff(s), (s - 1 ) mod 4 ] + Z\cost(s) 

jG{l....,fe} d' 


Proof. According to the definition of Ty [d, s] and Lemma 13 (ijl, we find that 
Ty\d, si = min cost(G — G) = min min costfG — G) 

GeC{G-,d,s) j GeC{G-,d,s,j) 


( 1 ) 


Using Lemma [T 3 ] ([mil yields 

min costfG — G) = min 


GeC(G-,d,s,j) 


min 

XeC{G^j-,d’) 
YGC(G+ivj)-d-d' ,s) 


cost{G- X -Y - Zj). ( 2 ) 


Note that G—Zj = G^jUG“''fuj). By Lemma 10 we have that for AT G C{G^j\d'),Y G 
C{G'^{vj);d — d', s), it is costfG — X — Y — Zj) = costfG^j U G+fuj) — X — Y) = 
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cost(G^j — X) + cost{G~^{vj) — Y). Therefore, 


min cost(G — X — Y — ZA 

XeC{G^j-,d') 

YGC(G+{vj)-,d-d' ,s) 

= min costfG^, ~ + min 

XGCiG^j;d') YeC(G+{vj)-,d-d',s) 


cost{G'^ (vj) — Y). 


( 3 ) 


By definition minxgc(G^j;d') cost(G^j — X) = Ta^.\d!]. Furthermore, G^{vj) is a 
tree RTG whose root v has the single child vj. Hence, by Lemma[T 3 ]([ii| and Lemma[^(|^, 
we find 


min cost(G'''(ui) — Y) 

YeC{G+{vj)-d-d>,s) 

= min cost(G(z;,) - Y) + Z\cost(s) 

YeC{G(vj)-d-d'-Aiiuis),s-l) 

= fy. [d-d' - Z\diff (s), S - 1] + Z\cost{s) 


( 4 ) 


Combining Equations [Tj^yields the claim. 


□ 


For leaves of a tree RTG G, T„[ 0 , 1 ] = 0 and all other entries are oo. We compute 
2 ^g[’] by iteratively applying Lemma in a bottom-up fashion, using Lemma 14 to 
compute T[-] from T[-, •] in linear time when needed. 


Lemma 16. Let G = (V, E) be a tree RTG with n vertices and root r. The tables 
Tr[-, ■] and Tg[-] can be computed in 0{n^) time. 
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Proof. First observe that given T, 
linear time according to Lemma 
in linear time. 

We now bound the computation time for Ty 
Given the tables T„J-, •], we can compute Ty 
each j = 1 ,..., A:, we first compute Tq^ [• 


, •] for V G V, table ' 7 g(^) [•] can be computed in 
In particular, Tq\] can be computed from •] 


]. Let V gV with children vi 
] by Lemma 
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in quadratic time by Lemma 


Vk- 

More precisely, for 
fol- 
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lowed by 0 {n) table lookups, one for each value of d'. Hence, processing v takes time 
0 (deg+(t;) • n2). Since Y.vev deg~'’(z;) = n — 1, the total processing time to compute 
Ty[-, •] in a bottom-up fashion is 0 {nf). □ 


4.3 Connected RTGs Containing a Cycle 

We now look at connected RTGs that contain a cycle. We first introduce an additional 
decomposition for copy sets to simplify the following calculations. 

Lemma 17. Let G = (V, E) be a connected RTG containing a directed cycle K and 
let ei,... ,ek denote the edges of K whose source has out-degree at least 2 . Let further 
O = {{u^v) G E \ u G K^{u^v) ^ K}. Then 


k 

C(G; d) = C{G - 0 ]d)® {O} U U C(G - e,; d) O {{ej}. 

2 = 1 
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Proof. Every copy set C G C(G; d) contains either some edge of K or it contains all 
edges in O. Note that edges of K that are not among ci,..., are not contained in 
any copy set. Thus, in the former case, Ci G C for some i G { 1 ,..., fc} and hence 
C G C{G — ei\d) ® {{ci}}. In the latter case C \ O is a copy set of G — O, hence 
C GC{G — 0 ]d)® {O}. Conversely, any copy set in C(G — O; d) C) { 0 } forms a copy 
set of G and also every copy set in C(G — ei\d) (g) {{ci}} for any value of i forms a 
copy set of G. This finishes the proof. □ 

As before, this decomposition can be used to efficiently compute TgI"] from the 
tables of smaller subgraphs of a connected RTG G containing a cycle. 

Lemma 18. Let G = (V, E) be a connected RTG containing a directed cycle K and 
let ei,... ,ek denote the edges of K whose source has out-degree at least 2 . Let further 
O = {{u,v) G E \ u G K,(u,v) ^ K}. Then 


To [d] = min Tq-o [rf], min Tc-Ci [d] 

i—1 


Proof Using the definition of Tq [•] and Lemma 17 we find that 

cost(G-G). 


Tc[d] = min costfG — G) 
CeC(G;d) 


CG(C(G-0;d)®{0})uULi(C(G-ei;d)®{{ei}}) 


As we minimize cost over a union of sets, we can minimize it over the sets individually 
and then take the minimum of the results. Hence, we find that 


min costfG — G) = min cost(G — O — C) = Ta-o\d\ 
CeCiG-O-,d)0{O} cec(G-o-,d) 


and 


min costfG — G) = min costfG — — G) = Tg-e Ml, 

GeC{G-ei-d)0{{e.}} G(iC(G-ei-d) 


which together yield the claim. □ 

Lemma 19 . Let G = (V, E) be a connected RTG containing a directed cycle. The 
table TqY\ can be computed in O(n^) time. 


Proof. Let ei,..., Cfc be the edges of the cycle K. Lirst, observe that G — is a tree for 
i = 1 ,. .. ,k. Hence, we can compute each table Tb-e- [•] in 0{nf) time by Lemma 16 
Thus, computing all these tables takes O(n^) time. 

Second, let O = {(u, v) G E \ u G K, {u^v) ^ K}. The graph G— O is the disjoint 
union of the cycle K and several tree RTGs Gi,..., Gj. The table Tk[-] has only one 
finite entry and can be computed in constant time. The tables Tq. [•] can be computed 
in 0{nf) time. Using Lemma 


12 


we then compute Tg_o[-] in quadratic time. 


With these tables available, we can compute Td-] according to Lemma 18 This 


takes O(n^) time. The overall running time is thus 0 {n'^). 


□ 
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4.4 Putting Things Together 


To compute 1 g[-] for an arbitrary RTG G, we first compute Tx[-] for each connected 
component K of G using Lemmas 16 and 19 Then, we compute Tc[‘] using Lemma 12 
and the length of an optimal shuffle code using Lemma To actually compute the 
shuffle code, we augment the dynamic program computing Tq [■] such that an optimal 
copy set C can be found by backtracking in the tables. An optimal shuffle code is then 
constructed by applying Greedy to G — C and adding one copy operation for each 
edge in C. 

Theorem 3 . Given an RTG G, an optimal shuffle code can be computed in 0 {n^) time. 
Proof. We compute all tables Tc[-], where G is a connected component of G, in 0 {ri^) 


time using Lemma 16 and Lemma 19 Lfsing Lemma 12 we then compute Ta[-] in 


0 (jnf) time. From this, we can compute the length of an optimal shuffle code by 
Lemma |8] 

In fact, it is not difficult to modify the dynamic program in a way that, given an 
entry a corresponding copy set G of G with cost(G — G) = Tg[(1\ can be 

computed by backtracking in the tables. Hence, to compute an optimal shuffle code for 
G, we first compute an optimal copy set Gopt of G in 0 {n'^) time. Then, we compute an 
optimal shuffle code tti ,..., tt^. for G — Gopt using Greedy, which takes linear time 
according to Theorem|^ Let tt = tt^ o ... o tti. For each edge (rt, v) S Gopt, we define 
a corresponding copy operation 7r(u) —> v. Let ci,... ,Ct be these copy operations 
in arbitrary order. Then the sequence S = tti ,..., , ci,..., Ct is an optimal shuffle 

code. This can be seen as follows. First, by Lemma the length of S is minimal. It 
remains to show that S is indeed a shuffle code for G. This is clearly true, as it first 
shuffles the values in the registers so that a subset of the values is in the correct position 
and then uses copy operations to transfer the remaining values to their destinations. □ 


5 Conclusion 

We have presented an efficient algorithm for generating optimal shuffle code using copy 
instructions and permutation instructions, which allow to arbitrarily permute the con¬ 
tents of up to five registers. As an intermediate result, we have proven the optimality 
of the greedy algorithm for factoring a permutation into a minimal product of permu¬ 
tations, each of which permutes up to five elements. It would be interesting to allow 
permutations of larger size. 
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